[TEMP]feat: add torch profiler for the function `update_weights` by JensenFire · Pull Request #10 · JD-ETH/slime

JensenFire · 2026-01-05T09:04:06Z

Add torch profiler for the update_weights part, with the tag --use-pytorch-profiler-update-weight. Users could find the file update_weights_call{step}_rank_{gpu_id}.pt.trace.json.gz. Open it with https://ui.perfetto.dev/ and the result is sth like:

This reverts commit cd23f8b.

JD-ETH

Vibe coding is convenient, and often even elegant, lol. But it tends to be very verbose, redundant, with fail safes --- let's try to be concise and careful here, and prioritize human readibility first; it's better to throw errors and crush runs immediately for the profiling.

slime/backends/megatron_utils/update_weight/update_weight_from_remote.py

slime/utils/profile_utils.py

JD-ETH · 2026-01-05T17:32:25Z

slime/utils/arguments.py

+            parser.add_argument(
+                "--profile-update-weight-start",
+                type=int,
+                default=0,


let's just have the start default to 1, and log everything afterwards? We only do 3 training steps anyways and this profiler will only be used for our profiling configs.

you mean we could skip the first step?

JD-ETH · 2026-01-05T17:32:48Z

slime/utils/arguments.py

            )
            parser.add_argument("--check-weight-update-equal", action="store_true")
+            parser.add_argument(
+                "--use-pytorch-profiler-update-weight",


does this impact performance of the run? If not, let's just set it default on

if we're not targeting at merging this feature into the official slime repo, i think it's ok. And i'll also tag this pr with [TEMP] and we'll revert this PR in the end.

JD-ETH · 2026-01-05T17:32:58Z

slime/utils/arguments.py

+                default=0,
+                help="After enabling PyTorch profiler for weight update operations, start profiling from this point. Requires --tensorboard-dir to be set.",
+            )
+            parser.add_argument(


and I'd prefer we remove this

or keep it, just like the profiler of slime itself?

oh they have this also?

This is the interesting part. Slime and Megatron-LM share the same argument use-pytorch-profiler (/root/Megatron-LM/megatron/training/arguments.py). We could also find the usage in TrainProfiler of /root/slime/slime/utils/profile_utils.py:

class TrainProfiler: def __init__(self, args): self.args = args self._torch_profiler_overall = None self._memory_profiler_overall = None if args.use_pytorch_profiler and ("train_overall" in args.profile_target): self._torch_profiler_overall = _create_torch_profiler(args, name="train_overall") if args.record_memory_history and ("train_overall" in args.profile_target): self._memory_profiler_overall = _BaseMemoryProfiler.create(args) self._memory_profiler_overall.start()

Basically, when --use-pytorch-profiler enabled, it will record all python functions between steps, and it's quite large (>100MB) if we want to get the mapping between python functions and the cpu/gpu occupying. Besides that, there're too many redundant parts in this profiler, since all we care about is updating weights. That's why i create another function profiler

slime/utils/profile_utils.py

JensenFire · 2026-01-06T07:57:52Z

Updated @JD-ETH

JD-ETH · 2026-01-06T16:40:14Z

slime/utils/arguments.py

+                default=0,
+                help="After enabling PyTorch profiler for weight update operations, start profiling from this point. Requires --tensorboard-dir to be set.",
+            )
+            parser.add_argument(


oh they have this also?

JD-ETH · 2026-01-06T16:40:24Z

slime/utils/arguments.py

            )
            parser.add_argument("--check-weight-update-equal", action="store_true")
+            parser.add_argument(
+                "--use-pytorch-profiler-update-weight",


JensenFire · 2026-01-07T08:31:34Z

Based on the comments, mark this PR as [TEMP], and it will be reverted in the future.

Add torch profiler for the update_weights part, with the tag `--use-pytorch-profiler-update-weight`. Users could find the file `update_weights_call{step}_rank_{gpu_id}.pt.trace.json.gz`. Open it with `https://ui.perfetto.dev/`

JensenFire added 4 commits January 5, 2026 07:48

feat: add all_gather timer

cd23f8b

feat: add torch profiler for updating weights

ea38293

Revert "feat: add all_gather timer"

5189204

This reverts commit cd23f8b.

feat: add torch profiler for update_weights function

d3ee45b

JensenFire requested review from JD-ETH and Risc-lt January 5, 2026 09:04

JD-ETH requested changes Jan 5, 2026

View reviewed changes

misc

3dbbb95

JD-ETH approved these changes Jan 6, 2026

View reviewed changes

JensenFire changed the title ~~feat: add torch profiler for the function update_weights~~ [TEMP]feat: add torch profiler for the function update_weights Jan 7, 2026

JensenFire merged commit a6715ba into JD-ETH:jd/rdma-integration Jan 7, 2026
1 check passed

Conversation

JensenFire commented Jan 5, 2026

Uh oh!

JD-ETH left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JensenFire Jan 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JensenFire Jan 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

JensenFire commented Jan 6, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JensenFire commented Jan 7, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

JensenFire Jan 6, 2026 •

edited

Loading

JensenFire Jan 7, 2026 •

edited

Loading